Local and global quality of service shaper on ingress in a distributed system

ABSTRACT

A distributed computing system, such as may be used to implement an electronic trading system, controls inbound message flow rates. Limiting a per-client or per-connection inbound message rate also helps ensure fair provisioning of computing resources, so that a single client&#39;s excessive use of resources cannot overwhelm the system to such an extent that it prevents other clients from interacting with the distributed system. It is also desirable to have system-wide control of the overall inbound message rate across all client connections. Such system-wide control ensures that the distributed system as a whole can maintain the required levels of service, including offering a predictable level of access for all clients.

BACKGROUND Technical Field

This patent application relates to connected devices, and moreparticularly to controlling both sustained and burst message rates.

Background

The financial instrument trading systems currently in widespread use inthe major stock exchanges allow traders to submit orders and receiveconfirmations, market data, and other information, electronically, viacommunications networks. The typical electronic trading system includesa matching engine, typically residing within a central server, and aplurality of gateways that provide access to the matching engine, aswell as other distributed processors. The typical order process can beas follows: request messages representing orders are received (e.g., bidorders and/or ask orders), as sent from client devices, e.g., traderterminals operated by human users or servers executing automated tradingalgorithms). An order acknowledgement is then typically returned to theclient devices via the gateway that forwarded the request. The exchangemay perform additional processing before the order processingacknowledgement is returned to the client device.

The exchange system may also disseminate information related to theorder message, either in the same form as received or otherwise, toother systems to generate market data output.

A “queue”, in the context of communications or data processing, can bethought of as a temporary storage device. A data source pushes data ontothe queue. The data sits idly in the queue until a data consumer isready to pop data from the queue.

In data communications, “flow control” is the process of managing therate of data transmission between two nodes. Flow control is used toprevent a fast sender from overwhelming a slow receiver. It provides amechanism for the receiver to control the transmission speed, so thatthe receiving node is not overwhelmed with data. Flow control caninvolve controlling a “sustained rate” such as an average amount of datatransmitted over time, or a “burst rate”, such as some peak data rateexperienced for a short period of time.

“Configuring Queuing and Flow Control”, in Cisco Nexus 5000 Series NX-OSQuality of Service Configuration Guide, Release 5.2(1)N1(1) Apr. 3, 2016is an example of “per-connection” flow control. An ingress Quality ofService (QoS) policy may be applied to an Ethernet interface toguarantee bandwidth for a specified traffic class. Buffer space, “nodrop” thresholds and other flow control parameters may be set for eachconnection.

“Hierarchy Virtual Queue Based Flow Control in LTE/SAE”, 2010 SecondInternational Conference on Future Networks, IEEE Mar. 30, 2010 is anapproach to flow control in a wireless network that associates ahierarchy of “virtual queues” with “real queues”. Note that flow controlmay be implemented at three “levels”—UE (mobile handset), Cell, andeNBs. While virtual queues control flow at their respective levels,there does not however appear to be any suggestion of “global” controlvia a device through which all message traffic passes before reaching aset of compute nodes.

Pre-grant Publication US2012/0195203 (Juniper) describes techniques forflow control using multi-staged queues. However, the “multi-stagedqueues” are located within a given network device, which may tend toadversely impact a latency-sensitive design.

SUMMARY OF PREFERRED EMBODIMENTS

As described herein, preferred embodiments of a distributed computingsystem, such as an electronic trading system, control inbound messageflow rates.

More particularly, in some distributed computing environments, it isdesirable to limit the rate at which messages can be received into thesystem by a given client (or a given connection). This may be useful,for example, to prevent the communications link(s) between thedistributed system and outside client(s) to become saturated and/or toprevent overloading the distributed system. Limiting the per-clientinbound message rate also helps ensure fair provisioning of computingresources, so that a single client's excessive use of resources cannotoverwhelm the system to such an extent that it prevents other clientsfrom interacting with the distributed system.

In addition to controlling the message ingress rate on a per-client (orper-connection) basis, it may also be desirable to have system-widecontrol of the overall ingress rate into the distributed system acrossall client connections. This system-wide control ensures that thedistributed system as a whole can maintain the required levels ofservice, including offering a predictable level of access for allclients.

Accordingly, a distributed data processing system or a correspondingmethod may control inbound message flow to a plurality of compute nodesand a system-level node. In such as system each of a plurality ofgateway nodes receive messages from one or more client connections,control a sustained rate and/or burst rate of the messages on aper-client or per-connection basis, and then forward the messages to oneor more compute nodes. A system level node receives the messages fromthe gateway nodes, controlling sustained flow and/or burst rate on aper-gateway-node-basis before forwarding the messages to compute nodes.As a result, the system level node thus also controls a system-widemessage flow rate.

The system may be used to implement an electronic trading system wherethe messages are electronic trading messages. In such an embodiment, thecompute nodes receive the electronic trading messages from the gatewaynodes and a sequencer node, then operate on the electronic tradingmessages to perform an electronic trading function; an generating aresponse message that is in turn returned to one or more clients throughthe one or more gateways.

In other aspects, the message flow rate may be further controlled on thesystem-wide basis by providing feedback to one or more of the gateways.Feedback can be supplied in a number of ways, such as by lowering awindow size on a per-connection basis for all connections from thesystem-level node to the gateway nodes; or lowering a burst andsustained rate configured in a per-connection token bucket for allconnections from the system-level node to the gateway nodes; or pausinga respective gateway.

In still other embodiments, the sustained and/or burst rate may befurther controlled on a per-client or per-connection basis by providingfeedback from the respective gateway to a respective client orconnection. Similar to the system-level control, applying feedback caninvolve lowering a window size on a per-client or per-connection basisfor all client connections into the gateway nodes, lowering a burst andsustained rate configured in a per-client or per-connection token bucketfor all client connections into the gateway nodes, or by pausing theclient or connection.

Pausing a connection may involve setting a window size to zero for allclients or connections into the respective gateway, not adding newmessages to a per-client or a per-connection FIFO for the respectivegateway, or not servicing messages from a per-client or a per-connectionFIFO queue.

In yet other aspects, a sustained flow rate or burst rate may becontrolled by queuing the messages in a plurality of queues, thenfeeding the messages from the queues to a plurality of token buckets;and selecting messages from token buckets.

BRIEF DESCRIPTION OF THE DRAWINGS

Additional novel features and advantages of the approaches discussedherein are evident from the text that follows and the accompanyingdrawings, where:

FIG. 1 is a high level block diagram of a distributed electronic tradingsystem.

FIG. 2 is a more detailed view of a system component such as a gatewayor compute node.

FIG. 3 shows an example of flow control at a point of client connection,such as within a gateway.

FIG. 4 shows an example of global (system-level) flow control.

FIG. 5 is a more detailed view of flow control within the gateway.

FIG. 6 is a more detailed view of flow control at a global (system)level.

FIG. 7 is an example using TCP window size to control backpressure atthe gateway.

FIG. 8 is an example using TCP window size to control backpressure atthe system level.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT(S) System Overview

Example embodiments disclosed herein relate to a high-speed electronictrading system that provides a market where orders to buy and sellfinancial instruments (such as stocks, bonds, commodities, futures,options, and the like) are traded among market participants (such astraders and brokers). The electronic trading system exhibits lowlatency, fairness, fault tolerance, and other features more fullydescribed below.

The electronic trading system is primarily responsible for “matching”orders to one another. In one example, an offer to “buy” an instrumentis matched to a corresponding counteroffer to “sell”. The matched offerand counteroffer must at least partially satisfy the desired price, withany residual unsatisfied quantity passed to another suitablecounterorder. Matched orders are then paired and the trade is executed.

Any wholly unsatisfied or partially satisfied orders are maintained in adata structure referred to as an “order book”. The retained informationregarding unmatched orders can be used by the matching engine to satisfysubsequent orders. An order book is typically maintained for eachinstrument and generally defines or otherwise represents the state ofthe market for that particular product. It may include, for example, therecent prices and quantities at which market participants have expresseda willingness to buy or sell.

The results of matching may also be made visible to market participantsvia streaming data services referred to as market data feeds. A marketdata feed typically includes individual messages that carry the pricingfor each traded instrument, and related information such as volume andother statistics.

FIG. 1 illustrates an example electronic trading system 100 thatincludes a number of gateways 120-1, 120-2, . . . , 120-g (collectivelyreferred to as gateways 120), a set of core compute nodes 140-1, 140-2,. . . , 140-c (collectively, the core compute nodes 140 or compute nodes140), and one or more sequencers 150-1, 150-2, . . . , 150-s(collectively, the sequencers 150). In some embodiments, the gateways120, core compute nodes 140, and sequencers 150 are thus considered tobe nodes in electronic trading system 100. As will be described in moredetail below, in one embodiment, the gateways 120, compute nodes 140 andsequencers 150 are directly connected to one another, preferably via lowlatency, dedicated connections 180.

The term “peer” in relation to the discussion of the system 100 refersto another device that generally serves the same function (e.g.,“gateway” vs. “core compute node” vs. “sequencer”) in electronic tradingsystem 100. For example, gateways 120-2, . . . , 120-g are the peers forgateway 120-1, core compute nodes 140-2, . . . , 140-c are the peers forcore compute node 140-1, and sequencers 150-2, . . . , 150-s are thepeers for sequencer 150-1.

The electronic trading system 100 processes orders from and providesrelated information to one or more participant computing devices 130-1,130-2, . . . , 130-p (collectively, the participant devices 130).Participant devices 130 interact with the system 100, and may be one ormore personal computers, tablets, smartphones, servers, or other dataprocessing devices configured to display and receive trade orderinformation. The participant devices 130 may be operated by a human viaa graphical user interface (GUI), or they may be operated via high-speedautomated trading methods running on some physical or virtual dataprocessing platform.

Each participant device 130 may exchange messages with (that is, sendmessages to and receive messages from) the electronic trading system 100via connections established with a gateway 120. While FIG. 1 illustrateseach participant device 130 as being connected to electronic tradingsystem 100 via a single connection to a gateway 120, it should beunderstood that a participant device 130 may be connected to electronictrading system 100 over multiple connections to one or more gatewaydevices 120.

Note that, while each gateway 120-1 may serve a single participantdevice 130, it typically serves multiple participant devices 130.

The compute nodes 140-1, 140-2, . . . , 140-n (also referred to hereinas matching engines 140 or compute engines 140) provide the matchingfunctions described above and may also generate outgoing messages to bedelivered to one or more participant devices 130. Each compute node 140is a high-performance data processor and typically maintains one or moredata structures to search and maintain one or more order books 145-1, .. . , 145-b. An order book 145-1 may be maintained, for example, foreach instrument for which the core compute node 140-1 is responsible.One or more of the compute nodes 140 and/or one or more of the gateways120 may also provide market data feeds 147. Market data feeds 147 may bebroadcast (for example, multicast), to subscribers, which may beparticipant devices 130 or any other suitable computing devices.

Some outgoing messages generated by core compute nodes 140 may besynchronous, that is, generated directly by a core compute node 140 inresponse to one or more incoming messages received from one or moreparticipant devices 130, such as an outgoing “acknowledgement message”or “execution message” in response to a corresponding incoming “neworder” message. In some embodiments, however, at least some outgoingmessages may be asynchronous, initiated by the trading system 100, forexample, certain “unsolicited” cancel messages and “trade break” or“trade bust” messages.

Distributed computing environments, such as the electronic tradingsystem 100, can be configured with multiple matching engines operatingin parallel on multiple compute nodes 140.

The sequencers 150 ensure that the proper sequence of anyorder-dependent operations is maintained. To ensure that operations onincoming messages are not performed out of order, incoming messagesreceived at one or more gateways 120, for example, a new trade ordermessage from one of participant devices 130, typically must then passthrough at least one sequencer 150 in which they are marked with asequence identifier. That identifier may be a unique, monotonicallyincreasing value which is used in the course of subsequent processingthroughout the distributed system 100 (e.g., electronic trading system100) to determine the relative ordering among messages and to uniquelyidentify messages throughout electronic trading system 100. It should beunderstood, however, that while unique, the identifier is not limited toa monotonically increasing or decreasing value. Once sequenced, themarked incoming messages, that is the sequence-marked messages, aretypically then forwarded by sequencer(s) 150 to other downstream computenodes 140 to perform potentially order-dependent processing on themessages.

In some embodiments, messages may also flow in the other direction, thatis, from a core compute node 140 to one or more of the participantdevices 130, passing through one or more of the gateways 120. Suchoutgoing messages generated by a core compute node 140 may also beorder-dependent, and accordingly may also typically first pass through asequencer 150 to be marked with a sequence identifier. The sequencer 150may then forward the marked response message to the gateways 120 inorder to pass on to participant devices 130 in a properly deterministicorder.

The use of a sequencer 150 to generate unique sequence numbers ensuresthe correct ordering of operations is maintained throughout thedistributed system 100, regardless of which compute node or set ofcompute nodes 140 processes the messages. This approach provides “statedeterminism,” to provide fault-tolerance, high availability and disasterrecoverability.

It may also be important for a generating node (i.e., a node introducinga new message into the electronic trading system 100, for example bygenerating a new message and/or by forwarding a message received from aparticipant device 130) and its peer nodes to receive the sequencenumber assigned to that message. Receiving the sequence number for amessage it generated may be useful to the generating node and its peernodes not only for processing messages in order according to theirsequence numbers, but also to correlate the message generated by thenode with the message's identifier that is used throughout the rest ofthe electronic trading system 100. A subsequent message generated withinthe electronic trading system 100, while also being assigned its ownsequence number, may yet reference one or more sequence numbers ofrelated preceding messages. Accordingly, a node may need to quicklyreference (by sequence number) a message the node had itself previouslygenerated, because, for example, the sequence number of the message thenode had generated was referenced in a subsequent message.

In some embodiments, the generating node may first send a message to thesequencer 150 and wait to receive the sequence number from the sequencerbefore the generating node forwards the message to other nodes inelectronic trading system 100.

In alternate example embodiments, to avoid at least one hop, which couldadd undesirable increased latency within electronic trading system 100,after receiving the un-sequenced message from the generating node,sequencer 150 may not only send a sequenced version of the message(e.g., a sequence-marked message) to destination nodes, but may alsosend substantially simultaneously a sequenced version of the messageback to the sending node and its peers. For example, after assigning asequence number to an incoming message sent from the gateway 120-1 tocore compute nodes 140, the sequencer 150 may not only forward thesequenced version of the message to the core compute nodes 140, but mayalso send a sequenced version of that message back to the gateway 120-1and the other gateways 120. Accordingly, if any subsequent messagegenerated in a core compute node 140 references that sequence number,any gateway 120 may easily identify the associated message originallygenerated by gateway 120-1 by its sequence number.

Similarly, in some further embodiments, a sequenced version of anoutgoing message generated by and sent from a core compute node 140 togateways 120, and sequenced by sequencer 150, may be forwarded bysequencer 150 both to gateways 120 and back to core compute nodes 140.

Some embodiments may include multiple sequencers 150 for highavailability, for example, to ensure that another sequencer is availableif the first sequencer fails. For embodiments with multiple sequencers150 (e.g., a currently active sequencer 150-1, and one or more standbysequencers 150-2, . . . , 150-s), the currently active sequencer 150-1may maintain a system state log (not shown) of all the messages thatpassed through sequencer 150-1, as well as the messages' associatedsequence numbers. This system state log may be continuously orperiodically transmitted to the standby sequencers to provide them withrequisite system state to allow them to take over as an activesequencer, if necessary.

The system state log may also be continually or periodically replicatedto one or more sequencers in a standby replica electronic trading system(not shown in detail) at a disaster recovery site 155, thereby allowingelectronic trading to continue with the exact same state at the disasterrecovery site 155, should the primary site of system 100 suffercatastrophic failure.

In some embodiments, the system state log may also be provided to a dropcopy service 152, which may be implemented by one or more of thesequencers, and/or by one or more other nodes in the electronic tradingsystem 100. The drop copy service 152 may provide a record of dailytrading activity through electronic trading system 100 that may bedelivered to regulatory authorities and/or clients, who may, for examplebe connected via participant devices 130. In alternate embodiments, thedrop copy service 152 may be implemented on one or more gateways 120.Furthermore, in addition to or instead of referencing the system statelog, the drop copy service 152 may provide the record of tradingactivity based on the contents of incoming and outgoing messages sentthroughout electronic trading system 100. For example, in someembodiments, a gateway 120 implementing the drop copy service 152 mayreceive from the sequencer 150 (and/or from core compute nodes 140 andother gateways 120) all messages exchanged throughout the electronictrading system 100. A participant device 130 configured to receive therecord of daily trading activity from the drop copy service 152 may notnecessarily also be sending orders to and utilizing a matching functionof electronic trading system 100.

Messages exchanged between participant devices 130 and gateways 120 maybe according to any suitable protocol that may be used for financialtrading (referred to for convenience as, “financial trading protocol”).For example, the messages may be exchanged according to custom protocolsor established standard protocols, including both binary protocols (suchas Nasdaq OUCH and NYSE UTP), and text-based protocols (such as NYSE FIXCCG). In some embodiments, the electronic trading system 100 may supportexchanging messages simultaneously according to multiple financialtrading protocols, including multiple protocols simultaneously on thesame gateway 120. For example, participant devices 130-1, 130-2, and130-3 may simultaneously have established trading connections and may beexchanging messages with gateway 120-1 according to Nasdaq Ouch, NYSEUTP, and NYSE FIX CCG, respectively.

Furthermore, in some embodiments, the gateways 120 may translatemessages according to a financial trading protocol received from aparticipant device 130 into a normalized message format used forexchanging messages among nodes within the electronic trading system100. The normalized trading format may be an existing protocol or maygenerally be of a different size and data format than that of anyfinancial trading protocol used to exchange messages with participantdevices 130. For example, the normalized trading format, when comparedto a financial trading protocol of the original incoming messagereceived at the gateway 120 from a participant 130, may include in somecases one or more additional fields or parameters, may omit one or morefields or parameters, and/or each field or parameter of a message in thenormalized format may be of a different data type or size than thecorresponding message received at gateway 120 from the participantdevice 130. Similarly, in the other direction, gateways 120 maytranslate outgoing messages generated in the normalized format byelectronic trading system 100 into messages in the format of one or morefinancial trading protocols used by participant devices 130 tocommunicate with gateways 120.

In the era of high-speed trading, in which microseconds or evennanoseconds are consequential, participants 130 exchanging messages withthe electronic trading system 100 are often very sensitive to latency,preferring low, predictable latency. The arrangement shown in FIG. 1accommodates this requirement by providing a point-to-point mesh 172architecture between at least each of the gateways 120 and each of thecompute nodes 140. In some embodiments, each gateway 120 in the mesh 172may have a dedicated high-speed direct connection 180 to the computenodes 140 and the sequencers 150.

For example, dedicated connection 180-1-1 is provided between gateway 1120-1 and core compute node 1 140-1, dedicated connection 180-1-2between gateway 1 120-1 and compute node 2 140-2, and so on, withexample connection 180-g-c provided between gateway 120-g and computenode 140-c, and example connection 180-s-c provided between sequencer150 and core c 140-c.

It should be understood that each dedicated connection 180 in the mesh172 is, in some embodiments, a point-to-point direct connection thatdoes not utilize a shared switch. A dedicated or direct connection maybe referred to interchangeably herein as a direct or dedicated “link”and is a direct connection between two end points that is dedicated(e.g., non-shared) for communication therebetween. Such adedicated/direct link may be any suitable interconnect(s) orinterface(s), such as disclosed further below, and is not limited to anetwork link, such as wired Ethernet network connection or other type ofwired or wireless network link. The dedicated/direct connection/link maybe referred to herein as an end-to-end path between the two end points.Such an end-to-end path may be a single connection/link or may include aseries of connections/links; however, bandwidth of the dedicated/directconnection/link in its entirety, that is, from one end point to anotherend point, is non-shared and neither bandwidth nor latency of thededicated/direct connection/link can be impacted by resource utilizationof element(s) if so traversed. For example, the dedicated/directconnection/link may traverse one or more buffer(s) or other elementsthat are not bandwidth or latency impacting based on utilizationthereof. The dedicated/direct connection/link would not, however,traverse a shared network switch as such a switch can impact bandwidthand/or latency due to its shared usage.

For example, in some embodiments, the dedicated connections 180 in themesh 172 may be provided in a number of ways, such as a 10 GigabitEthernet (GigE), 25 GigE, 40 GigE, 100 GigE, InfiniBand, PeripheralComponent Interconnect-Express (PCIe), RapidIO, Small Computer SystemInterface (SCSI), FireWire, Universal Serial Bus (USB), High DefinitionMultimedia Interface (HDMI), or custom serial or parallel busses.

Therefore, although the compute engines 140, gateways 120, sequencers150 and other components may sometimes be referred to herein as “nodes”,the use of terms such as “compute node” or “gateway node” or “sequencernode” or “mesh node” should not be interpreted to mean that particularcomponents are necessarily connected using a network link, since othertypes of interconnects or interfaces are possible. Further, a “node,” asdisclosed herein, may be any suitable hardware, software, firmwarecomponent(s), or combination thereof, configured to perform therespective function(s) set forth for the node. As explained in moredetail below, a node may be a programmed general purpose processor, butmay also be a dedicated hardware device, such as a field programmablegate array (FPGA), application specific integrated circuit (ASIC), orother hardware device or group of devices, logic within a hardwaredevice, printed circuit board (PCB), or other hardware component.

It should be understood that nodes disclosed herein may be separateelements or may be integrated together within a single element, such aswithin a single FPGA, ASIC, or other element configured to implementlogic to perform the functions of such nodes as set forth herein.Further, a node may be an instantiation of software implementing logicexecuted by general purpose computer and/or any of the foregoingdevices.

Conventional approaches to connecting components, such as the computeengines 140, gateways 120, and sequencers 150 through one or more sharedswitches, do not provide the lowest possible latency. These conventionalapproaches also result in unpredictable spikes in latency during periodsof heavier message traffic.

In an example embodiment, dedicated connections 180 are also provideddirectly between each gateway 120 and each sequencer 150, and betweeneach sequencer 150 and each core compute node 140. Furthermore, in someembodiments, dedicated connections 180 are provided among all thesequencers, so that an example sequencer 150-1 has a dedicatedconnection 180 to each other sequencer 150-2, . . . , 150-s. While notpictured in FIG. 1, in some embodiments, dedicated connections 180 mayalso be provided among all the gateways 120, so that each gateway 120-1has a dedicated connection 180 to each other gateway 120-2, . . . ,120-g. Similarly, in some embodiments, dedicated connections 180 arealso provided among all the compute nodes 140, so that an example corecompute node 140-1 has a dedicated connection 180 to each other corecompute node 140-2, . . . , 140-c.

It should also be understood that a dedicated connection 180 between twonodes (e.g., between any two nodes 120, 150, or 140) may in someembodiments be implemented as multiple redundant dedicated connectionsbetween those same two nodes, for increased redundancy and reliability.For example, the dedicated connection 180-1-1 between gateway 120-1 andcore compute node 140-1 (e.g., Core 1) may actually be implemented as apair of dedicated connections.

In addition, according to some embodiments, any message sent out by anode is sent out in parallel to all nodes directly connected to it inthe point-to-point mesh 172. Each node in the mesh 172 may determine foritself, for example, based on the node's configuration, whether to takesome action upon receipt of a message, or whether instead simply toignore the message. In some embodiments, a node may never completelyignore a message; even if the node, due to its configuration, does nottake substantial action upon receipt of a message, it may at least takeminimal action, such as consuming any sequence number assigned to themessage by the sequencer 150. That is, in such embodiments, the node maykeep track of a last received sequence number to ensure that when thenode takes more substantial action on a message, it does so in propersequenced order.

For example, a message containing an order to “Sell 10 shares ofMicrosoft at $190.00” might originate from participant device 130-1,such as a trader's personal computer, and arrive at gateway 120-1 (i.e.,GW 1). That message will be sent to all core compute nodes 140-1, 140-2,. . . , 140-c even though only core compute node 140-2 is currentlyperforming matching for Microsoft orders. All other core compute nodes140-1, 140-3, . . . , 140-c may upon receipt ignore the message or onlytake minimal action on the message. For example, the only action takenby 140-1, 140-3, . . . , 140-c may be to consume the sequence numberassigned to the message by the sequencer 150-1. That message will alsobe sent to all of the sequencers 150-1, 150-2, . . . , 150-s even thougha single sequencer (in this example, sequencer 150-1) is the currentlyactive sequencer servicing the mesh. The other sequencers 150-2, . . . ,150-s also received the message to allow them the opportunity to takeover as the currently active sequencer should sequencer 150-1 (thecurrently active sequencer) fail, or if the overall reliability of theelectronic trading system 100 would increase by moving to a differentactive sequencer. One or more of the other sequencers (sequencer 150-2for example) may also be responsible for relaying system state to thedisaster recovery site 155. The disaster recovery site 155 may include areplica of electronic trading system 100 at another physical location,the replica comprising physical or virtual instantiations of some or allof the individual components of electronic trading system 100.

By sending each message out in parallel to all directly connected nodes,the system 100 reduces complexity and also facilitates redundancy andhigh availability. If all directly connected nodes receive all messagesby default, multiple nodes can be configured to take action on the samemessage in a redundant fashion. Returning to the example above of theorder to “Sell 10 shares of Microsoft at $190.00”, in some embodiments,multiple core compute nodes 140 may simultaneously perform matching forMicrosoft orders. For example, both core compute node 140-1 and corecompute node 140-2 may simultaneously perform matching for Microsoftmessages, and may each independently generate, after having received theincoming message of the “Sell” order, a response message such as anacknowledgement or execution message that each of core compute node140-1 and core compute node 140-2 sends to the gateways 120 through thesequencer(s) 150 to be passed on to one or more participant devices 130.

Because of the strict ordering and state determinism assured by thesequencer(s) 150, it is possible to guarantee that each of theassociated response messages independently generated by and sent fromthe core compute nodes 140-1 and 140-2 are substantially equivalent;accordingly, the architecture of electronic trading system 100 readilysupports redundant processing of messages, which increases theavailability and resiliency of the system. In such embodiments, gateways120 may receive multiple associated outgoing messages from core computenodes 140 for the same corresponding incoming message. Due to the factthat it can be guaranteed that these multiple associated responsemessages are equivalent, the gateways 120 may simply process only thefirst received outgoing message, ignoring subsequent associated outgoingmessages corresponding to the same incoming message. In someembodiments, the “first” and “subsequent” messages may be identified bytheir associated sequence numbers, as such messages are sequence-markedmessages. Allowing the gateways 120 to take action on the first ofseveral associated response messages to reach them may therefore alsoimprove the overall latency of the system.

Such a point-to-point mesh 172 architecture of system 100, besidessupporting low, predictable latency and redundant processing ofmessages, also provides for built-in redundant, multiple paths. As canbe seen, there exist multiple paths between any gateway 120 and anycompute node 140. Even if a direct connection 180-1-1 between gateway120-1 and compute node 140-1 becomes unavailable, communication is stillpossible between those two elements via an alternate path, such as bytraversing one of the sequencers 150 instead. Thus, more generallyspeaking, there exist multiple paths between any node and any other nodein the mesh 172.

Furthermore, this point-to-point mesh architecture inherently supportsanother important goal of a financial trading system, namely, fairness.The point-to-point architecture with direct connections between nodesensures that the path between any gateway 120 and any core compute node140, or between the sequencer 150 and any other node has identical or,at least very similar latency. Therefore, two incoming messages sent outto the sequencer 150 at the same time from two different gateways 120should reach the sequencer 150 substantially simultaneously. Similarly,an outgoing message being sent from a core compute node 140 is sent toall gateways 120 simultaneously, and should be received by each gatewayat substantially the same time. Because the topology of thepoint-to-point mesh does not favor any single gateway 120, chances areminimized that being connected to a particular gateway 120 may give aparticipant device 130 an unfair advantage or disadvantage.

Additionally, the point-to-point mesh architecture of system 100 allowsfor easily reconfiguring the function of a node, that is, whether a nodeis currently serving as a gateway 120, core compute node 140 orsequencer 150. It is particularly easy to perform such reconfigurationin embodiments in which each node has a direct connection between itselfand each other node in the point-to-point mesh. When each node isconnected via a direct connection to each other node in the mesh, nore-wiring or re-cabling of connections 180 (whether physical or virtual)within the point-to-point mesh 172 is required in order to change thefunction of a node in the mesh (for example, changing the function of anode from a core compute node 140 to a gateway 120, or from a gateway120 to a sequencer 150). In such embodiments, the reconfigurationrequired that is internal to the mesh 172 may be easily accomplishedthrough configuration changes that are carried out remotely. In the caseof a node being reconfigured to serve as a new gateway 120 or beingreconfigured from serving as a gateway 120 to another function, theremay be some ancillary networking changes required that are external tothe mesh 172, but the internal wiring of the mesh may remain intact.

Accordingly, in some embodiments, the reconfiguration of the function ofa node may be accomplished live, even dynamically, during trading hours.For example, due to changes on characteristics of the load of electronictrading system 100 or new demand, it may be useful to reconfigure a corecompute node 140-1 to instead serve as an additional gateway 120. Aftersome possible redistribution of state or configuration to other computenodes 140, the new gateway 120 may be available to start accepting newconnections from participant devices 130.

In some embodiments, lower-speed, potentially higher latency sharedconnections 182 may be provided among the system components, includingamong the gateways 120 and/or the core compute nodes 140. These sharedconnections 182 may be used for maintenance, control operations,management operations, and/or similar operations that do not requirevery low latency communications, in contrast to messages related totrading activity carried over the dedicated connections 180 in the mesh172. Shared connections 182, carrying non-trading traffic, may be overone or more shared networks and via one or more network switches, andnodes in the mesh may be distributed among these shared networks indifferent ways. For example, in some embodiments, gateways 120 may allbe in a gateway-wide shared network 182-g, compute nodes 140 may be intheir own respective compute node-wide shared network 182-c, andsequencers 150 may be in their own distinct sequencer-wide sharednetwork 182-s, while in other embodiments all the nodes in the mesh maycommunicate over the same shared network for these non-latency sensitiveoperations.

Distributed computing environments such as electronic trading system 100sometimes rely on high resolution clocks to maintain tightsynchronization among various components. To that end, one or more ofthe nodes 120, 140, 150 might be provided with access to a clock, suchas a high-resolution GPS clock 195 in some embodiments.

For purposes of the following discussion, gateways 120, compute nodes140, and sequencers 150 connected in the mesh 172 may be referred to as“Mesh Nodes”. FIG. 2 illustrates an example embodiment of a Mesh Node200 in the point-to-point mesh 172 architecture of electronic tradingsystem 100. Mesh node 200 could represent a gateway 120, a sequencer150, or a core compute node 140, for example. Although in this example,functionality in the Mesh Node 200 is distributed across both hardwareand software, Mesh Node 200 may be implemented in any suitablecombination of hardware and software, including pure hardware and puresoftware implementations, and in some embodiments, any or all ofgateways 120, compute nodes 140, and/or sequencers 150 may beimplemented with commercial off-the-shelf components.

In the embodiment illustrated by FIG. 2, in order to achieve lowlatency, some functionality is implemented in hardware in Fixed LogicDevice 230, while other functionality is implemented in software inDevice Driver 220 and Mesh Software Application 210. Fixed Logic Device230 may be implemented in any suitable way, including anApplication-Specific Integrated Circuit (ASIC), an embedded processor,or a Field Programmable Gate Array (FPGA). Mesh Software Application 210and Device Driver 220 may be implemented as instructions executing onone or more programmable data processors, such as central processingunits (CPUs). Different versions or configurations of Mesh SoftwareApplication 210 may be installed on Mesh Node 200 depending on its role.For example, based on whether Mesh Node 200 is acting as a gateway 120,sequencer 150, or core compute node 140, a different version orconfiguration of Mesh Software Application 210 may be installed.

While any suitable physical communications link layer may be employed,(including USB, Peripheral Component Interconnect (PCI)-Express, HighDefinition Multimedia Interface (HDMI), 10 Gigabit Ethernet (GigE), 25GigE, 40 GigE, 100 GigE, or InfiniBand (IB), over fiber or coppercables), in this example, Mesh Node 200 has multiple low latency 10Gigabit Ethernet SFP+ connectors (interfaces) 270-1, 270-2, 270-3, . . ., 270-n, (known collectively as connectors 270). Connectors 270 may bedirectly connected to other nodes in the point-to-point mesh viadedicated connections 180, connected via shared connections 182, and/orconnected to participant devices 130 via a gateway 120, for example.These connectors 270 are electronically coupled in this example to 10GigE MAC Cores 260-1, 260-2, 260-3, . . . , 260-n, (known collectivelyas GigE Cores 260), respectively, which in this embodiment areimplemented by Fixed Logic Device 230 to ensure minimal latency. Inother embodiments, 10 GigE MAC Cores 260 may be implemented byfunctionality outside Fixed Logic Device 230, for example, in PCI-Enetwork interface card adapters.

In some embodiments, Fixed Logic Device 230 may also include othercomponents. In the example of FIG. 2, Fixed Logic Device 230 alsoincludes a Fixed Logic 240 component. In some embodiments, fixed Logiccomponent 240 may implement different functionality depending on therole of Mesh Node 200, for example, whether it is a gateway 120,sequencer 150, or core compute node 140. Also included in Fixed LogicDevice 230 is Fixed Logic Memory 250, which may be a memory that isaccessed with minimal latency by Fixed Logic 240. Fixed Logic Device 230also includes a PCI-E Core 235, which may implement PCI Expressfunctionality. In this example, PCI Express is used as a conduitmechanism to transfer data between hardware and software, or morespecifically, between Fixed Logic Device 240 and the Mesh SoftwareApplication 210, via Device Driver 220 over PCI Express Bus 233.However, any suitable data transfer mechanism between hardware andsoftware may be employed, including Direct Memory Access (DMA), sharedmemory buffers, or memory mapping.

In some embodiments, Mesh Node 200 may also include other hardwarecomponents. For example, depending on its role in the electronic tradingsystem 100, Mesh Node 200 in some embodiments may also includeHigh-Resolution Clock 195 (also illustrated in and discussed inconjunction with FIG. 1) used in the implementation of high-resolutionclock synchronization among nodes in electronic trading system 100. ADynamic Random-Access Memory (DRAM) 280 may also be included in MeshNode 200 as an additional memory in conjunction with Fixed Logic Memory250. DRAM 280 may be any suitable volatile or non-volatile memory,including one or more random-access memory banks, hard disk(s), andsolid-state disk(s), and accessed over any suitable memory or storageinterface.

Quality of Service Shaper on Ingress

As mentioned above, the architecture of system 100 inherently supportsanother important goal of a distributed processing system, namelycontrolling the rate(s) at which incoming messages can be received bythe system.

Limiting the per-client inbound message rate also helps ensure fairprovisioning of computing resources, so that a single client's excessiveuse of resources cannot overwhelm the system to such an extent that itprevents other clients from interacting with the system.

When possible, the control should be over both the sustained inboundmessage rate as well as a burst rate.

In addition to controlling the message ingress rate on a per-clientbasis, it may also be desirable to have system-wide control of theoverall ingress rate across all client connections. This system-widecontrol ensures that the distributed system as a whole can maintain therequired levels of service, including offering a predictable latencylevel for all its clients.

More particularly, FIG. 3 shows a more detailed view of a flow controlportion 300 of an example gateway node 120 that was described inconnection with the electronic trading system 100 of FIGS. 1 and 2. Byway of review, incoming messages enter the distributed system 100 in aningress direction over connections established between clients (such asone or more participant devices 130) and the gateway node 120.

The flow control 300 includes a per-connection queue 310, aper-connection QoS shaper 320, a round robin arbiter 330, and QoSparameter store 340. In some embodiments, the messages discussed hereinare typically application level messages, such as requests to make atrade in an electronic trading system. As such, multiple applicationmessages may be contained in a single inbound data structure such as aTCP packet. As will be discussed in detail below, traffic rate-shapingis performed at the message level, but the (flow control is implementedat some other level, such as at the per-connection level (for example,by controlling the TCP window size for each connection).

The per-connection queue 310 may include a set of FIFO queues 312-1,312-2, . . . , 312-n) with a FIFO 312 associated with a correspondingconnection 131-1, 131-2, . . . , 131-n. The per-connection queue 310 isresponsible for holding incoming messages as they arrive before they areserviced by the rest of the distributed system. As a message is dequeuedfrom the per-connection queue 310 to be serviced by the rest of thesystem 100, it enters a QOS shaper 320.

The QoS shaper 320 provides a corresponding set of token buckets 322-1,322-2, . . . , 322-n, with one token bucket 322 associated with eachclient connection 131. The token bucket 322 for a given connection 1301enforces the configured sustained flow rate and burst flow rate of thatconnection 131. For example, if the participant (client) 130 has notsent any messages in a while, the token bucket 322 allows the message topass straight through. However, token bucket 322 instead throttles theclient connection 131, as explained below, if the client has sentmessages too quickly.

In other words, messages received from client connection 131-1 feed intoFIFO 312-1 and then into token bucket 322-1. Messages received fromclient connection 131-2 feed into FIFO 312-2 and then into token bucket322-2. Messages from connection 131-n feed into FIFO 312-n and then intotoken bucket 322-n. As explained below in more detail, messages from thetoken buckets 322-1, 322-2, 322-n aggregate into round robin arbiter330.

A token may be considered to act as a “ticket” that allows a singlemessage to pass through a token bucket 322. If one or more tokens are inthe bucket, a message may consume one token from the bucket and passstraight through. When a message passes through a token bucket 322, thetoken consumed by the message is removed from the bucket.

Therefore, the burst rate is determined by the number of tokens that atoken bucket 322 can hold. The sustained rate is a rate at which tokensmay be added to the token bucket, which also corresponds to the maximumpossible sustained “drain” rate of the bucket (if a message passingstraight through the bucket is considered to be ‘draining’ from thebucket).

In this embodiment, with a round-robin arbiter 330 on the output end ofthe token buckets 322, by saying that a message ‘passes straightthrough’ the token bucket, it is meant that the message is immediatelyavailable to be pulled out by the arbiter 330. In other words, theround-robin arbiter 330 is continually cycling through eachper-connection token bucket 322 to see, for each token bucket 322, ifthere is any message ready to be passed straight through.

The token buckets 322 assist with ensuring fairness, by not allowing asingle connection 131 or set of connections to overly consume theresources represented by cores 140. Yet token buckets 322 also “reward‘good behavior’, meaning that if a client 130 matches their sending rateto the sustained flow rate of its assigned token bucket 322, theirmessages should pass straight through with no or minimal latency impact.

The token buckets 322 may be thus managed in the following manner:

-   -   a) Tokens are added to each bucket at program selected rates        ranging from a few microseconds up to a second;    -   b) The number of tokens added at each interval may be        controlled, e.g., program-selected; and    -   c) The maximum number of tokens in each bucket is also        programmable.

It should be noted that a) and b) together correspond to the desiredsustained rate, and c) corresponds to the burst rate.

The net effect is that:

-   -   a) The maximum number of orders per period per client may be        individually selected;    -   b) The maximum burst order rate may also be selectable per        client;    -   c) The overall order rate may also be selectable per client; and    -   d) The overall system order rate may also be selectable.

One or more of these parameters 340 may be tuned as one possible way todeal with a situation where it becomes necessary to slow down the flowof messages coming in from one or more connections 131. Moreparticularly, if significant backpressure is detected for a connection(for example, the FIFO queue 322 for a given connection 131 is fillingup beyond a certain point), a feedback mechanism may be used by whichthe gateway 120 notifies the corresponding client 130 to slow down itsmessage transmission rate.

As will be explained in more detail with FIGS. 7 and 8 below, in oneexample implementation, if the client connection 131 is established overTCP, the gateway 120 could reduce the advertised TCP receive window sizefor that connection. Explicitly reducing the TCP window size is but oneway to control the rate at which messages are received, however. Anotherway to control the sustained rate and burst rate when a connection 131needs to be slowed down is by explicitly tuning the token bucketparameters 340 themselves, specifying a burst rate and/or sustained ratefor that connection 131. Lowering the burst rate and/or sustained ratemay end up eventually reducing the TCP window size, as an artifact ofhow the TCP protocol itself works (such as when TCP detects that anapplication is not servicing packets at a sufficiently fast rate. It maybe possible to use either approach, or a combination of bothapproaches).

In some embodiments, there may be an advantage to tuning the QoS tokenbucket parameters 340, rather than by adjusting the TCP window size. Byspecify the token bucket parameters, such as in units of messages versusbeing the limit being specified in units of bytes (as would be a TCPwindow size) the system 100 becomes “protocol agnostic”. That is, byadjusting the token bucket parameters directly, the system 100 would notfavor protocols which tend to use larger or smaller message sizes overother protocols. For example, FIX messages may tend to be larger thanmessages encoded with binary protocols, but by controlling the rate ofingress flow on a per-message basis, clients 130 sending messages over aFIX connection 131 are not penalized for choosing a protocol with largeroverall message sizes. Even within the same message type in the sameprotocol, there could be message size variability, through the use ofoptional tags or parameters by the client 130. For example, in FIX, aclient 130 may choose to send larger FIX messages by includingadditional information in optional tags, and it may not be desirable topenalize such clients with regards to flow control if the client choseto include more information in their messages.

In some implementations, a round-robin arbiter 330 (which may also haveits own internal FIFO queue(s)) is located downstream from the QOSshaper 320. Arbiter 330 cycles through the output of the set of tokenbuckets 322-1, 322-2, . . . , 322-n in a round-robin fashion, pulling amessage out from a token bucket, and then forwarding the message to beserviced by the rest of the distributed system, such as by one or moreof the cores 140.

Note also that consistent with the example implementation of FIG. 1, themessage may also be sent by the gateway to the sequencer 150. In otherembodiments, the gateway 120 may only send the message to the sequencer150, in which case the sequencer 150 may then forward the message to oneor more of the cores 140.

The QOS parameter settings 340 provide inputs to control the behavior ofthe per-connection queue 310 and QOS shaper 320. These parametersettings 340 may specify a maximum depth for a correspondingconnection's FIFO 312, and/or the size of its corresponding token bucket322. The QOS parameter settings 340 may be applied on a per-connectionbasis, even if they may be specified on a system-wide, a per-gateway, aper-client and/or a per-connection basis. As already explained above,these QoS parameters control the burst rate and the sustained rate.

More particularly, in some embodiments, the sustained rate and burstrate in the QOS shaper 320 may be configured on a per-client orper-connection basis, thereby allowing the provider of the distributedsystem to charge different clients varying amounts depending on theconfigured inbound rate for a given client connection 131.

Typically, if a client desires to increase its maximum ingress messagerate into a financial matching engine, the client would be required toadd additional connections 131 into system 100 to gain more access tothe matching engine(s) provided by the nodes 140. These multipleconnections not only take time (sometimes one day or more) to configure,but often require human intervention and coordination among various datacenter service providers, and are therefore also prone to human error.Allowing a client to dynamically adjust during the trading day itsmaximum inbound message rate over a single connection into the matchingengine without the need to create more connections therefore providesthe client additional flexibility and minimizes the risk ofmisconfiguration.

Nonetheless, it should be understood that each client 130 may possiblyuse more than one connection 131, or a given connection 131 may servicemore than one client 130. The “per-connection” queue 310 and the“per-connection QOS shaper” 320 may therefore, alternatively, be a“per-client” queue or “per-client” shaper in some embodiments.

FIG. 4 illustrates another aspect of a preferred implementation of thesystem 100. Here, each gateway 120-1, 120-2, . . . , 120-n has its ownrespective per-connection flow control 300-1, 300-2, . . . , 300-n, andthere is also a global flow control 500 that may be resident within oneor more of the sequencers 150. Global flow control 500 provides anothersystem-wide QoS layer, with feedback to the gateways 120. This enablesthe system 100 to also slow down the gateways when necessary, helping toensure that the system as a whole can always operate within the requiredservice levels. The sequencer 150, its corresponding global flow control500, and the core(s) 140 will be referred to herein as the distributedsystem core 420.

In some implementations, when a message enters the distributed systemcore 400 from a gateway node 120, it always first passes through asingle node (e.g., the sequencer 150). This system-wide view presentedto the sequencer 150 allows it to, when necessary, limit the overallrate at which messages across all gateways 120-1, 120-2, . . . , 120-nand thus all connections enter the distributed system core 420. Thus,via the global flow control 500 (as controlled by global QoS parameters540), the overall incoming message rate for the system 100 as a wholemay be controlled.

It should be understood that the global control QoS parameters 540 maybe temporarily adjusted and tuned dynamically depending upon currentconditions being experienced by system 100. For example, under a suddenperiod of heavy load, a catastrophic event, or system failure, etc., theglobal parameters 540 may be accordingly adjusted.

FIG. 5 is a more detailed view of an example global flow control 500,which may include a message queue 510 comprising a set of per-gatewayFIFOs 512-,1, 512-2, . . . , 512-n, a global QoS shaper 520 whichincludes a set of per-gateway token buckets 522-1, 522-2, . . . , 522-n,a global arbiter 530, and global QOS parameters 540. These elementsfunction similar to the corresponding elements with the gateway flowcontrol units 300, but instead operate to control the flow between thegateways 120 and the cores 140.

In some embodiments the rate limiting provided by the global flow 500may take the form of just the FIFO queues 512 alone, while in otherembodiments, global flow may also use the token buckets 522 for incomingrate shaping.

Global flow control 500 provides an additional advantage to the systemin some circumstances. For example, if every client using the system 100tends to operate near their assigned messaging rate, such that eachclient might not be exceeding their individual token buckets, theoverall sum of the rates might exceed what the system can handle.

As explained above, every message entering the system 100 is expected tobe forwarded to the sequencer, and hence arrive at the global queue 510.By controlling a sustained and/or burst rate at this point, the globalflow control becomes a single “choke point” at which all inbound systemmessaging controlled.

If the global flow control 500 becomes overloaded, it has a feedbackpath for adjusting flow control with the gateways. This is not normallyexpected to happen, since the system 100 should be normally designed tohandle an expected peak number of incoming messages. In other words, thedesigner of the system 100 can determine the maximum needed provisioningfor the cores, given a maximum incoming message rate for the clients.

Also, upon an indication that any one gateway starts to experiencebackpressure (such as due to congestion), the global flow control 500can decide to slow down all of the gateways, and not just the oneexperiencing overload. This could yield fairer results as opposed toonly slowing down the overloaded gateway.

Another observation about this approach is that rate at which ingresstraffic is allowed inherently controls the net egress traffic in theother direction (e.g., response message flowing from the cores 140 tothe clients 130). This may be the case in the context of a system suchas a trading system where an ingress message typically generates acorresponding egress message. In other words, by controlling the rate atwhich trade orders are allowed to enter the system 100, there isinherent control over a rate at which the system 100 generates messagesthat represent the dispositions of those orders.

Controlling QOS on a both a per-connection and an overall system basisalso helps with an aspect of access fairness in an electronic tradingsystem. In systems without such control, a client 130-1 using gateway120-1 with three other “heavy traders” may not be given as much accessas another client 130-2 who is the only client connected to a secondgateway 120-2. By instead servicing the both the connections and thegateways on a round-robin basis each client is given its fair share ofaccess, and no one client will be “crowded out” by the others.

When gateways 120 are provisioned, their maximum sustained and burstrate may be configured such that any single gateway cannot overload theability of the system 100 to process messages. This can be accomplishedby appropriately setting the QoS parameters 340 and/or by limiting thespeed at which individual connections can send messages.

Also, in some embodiments, a trading system 100 may preferablyover-provision the capability of set of cores 140 such that they willcollectively always be guaranteed to easily handle far more messagesthan the maximum number of inbound trading messages from all clients andall gateways. This will also assist with ensuring fairness of access.

In some embodiments, the message rate limiting at the sequencer 150(e.g, by the global flow control 500) is provided via a simple FIFOqueue 535. In such an instance, the gateways 120 may detect backpressure(e.g., congestion) at the global level implicitly as the global queue510 in the sequencer 150 fills up. In this instance, the gateways 120may adjust their own QOS shaper 320 accordingly, to perhaps temporarilyfurther limit the incoming flow of messages into the distributed systemcore 420.

In another embodiment, where the global flow 500 only uses a FIFO 510and not also a QoS shaper 520, a request message may be sent back fromglobal flow 500 to the gateways 120 such as over interface 182 to slowdown when a nearly full FIFO 510 condition is detected. Moreparticularly, such a message could indicate that the sender temporarilycan no longer receive any messages or will soon exhaust its queue.

In general, the system 100 may be configured such that receiver nodes(which may be any one or more of the nodes in system 100, periodicallycommunicate to the sender node(s) a special administrative type message(i.e., not a trading message) with an indication of how much more data(e.g., in units of trading messages, or bytes, or some othermeasurement), that receiver is capable of receiving. For example, theglobal flow 500 in sequencer 150 may periodically communicate back tothe gateways 120 an indication of how much “room” it has to receiveadditional messages from the gateways 120. The QOS shaper 320 on thesender node (e.g., gateway 120), then adjusts its QOS parameter settings340 appropriately.

The gateway(s) 120 might also then propagate that information to beapplied across all connections 131 on the gateway 120. The adjustmentscould involve, for example, making changes to each connection's TCPwindow size and/or adjusting the corresponding per-connection tokenbucket 310 parameters.

Similar administrative messages may be exchanged at other points in thesystem, for example, from the cores 140 to the sequencer 150 andgateways 120 for messages flowing in the inbound direction (fromparticipants 130), and even in the other direction (e.g. outbounddirection) for messages flowing from the sequencer 150 or the cores 140to the gateways 130. In the case of congestion in the outbounddirection, the QOS shapers 320 in the gateways and QOS shapers 520 insequencer 150 (e.g., in the global flow control 500) may still adjustthe corresponding QOS parameters 540 in the inbound direction.

In other embodiments with a more active form of rate limiting at theglobal flow control 500 in the sequencer 150, such by using a QOS shaper520 having a token bucket 522 per gateway, the global flow control 500may proactively communicate back to the gateway nodes 130 to requestthat they temporarily slow down or even pause their flow of exitingmessages. This could be done, for example, by reducing the gateway(s)′flow to one-half of their usual permitted level). After the flow isadjusted, the global flow controller 500 may then indicate to thegateway nodes 120 to resume normal operation.

FIG. 6 illustrates one embodiment of a hierarchy of flow controlmessages that may be used in the system 100. As explained above,gateways 120 feed messages to the sequencer 150 which in turn forwardsmessages to the cores 140. At the various places in the system wherequeuing is possible, flow control is also implemented (in any of theways already explained). Accordingly, for example, the shaper 330 in agateway 120 may apply flow control to the clients 130; the shaper 520 inthe sequencer 150 may apply flow control to the gateway(s) 120; and thecores 140 may apply flow control to shaper 520 the sequencer 150.

A gateway 120 may adjust flow control for an individual client 130 thatis causing an overload, or it can throttle back all clients 130 that ithandles. Similarly, the sequencer can, in some embodiments, adjust flowcontrol to an individual gateway 120 that is causing an overload, or itcan throttle back all gateways 120 until the system wide overload iscleared.

There are several ways to implement flow control, such as via a pauseoperation. This could be accomplished by pausing a clock that feeds therespective token buckets in a respective gateway shaper 320 or globalshaper 520. When sufficient messages have been cleared, the flow controlcan be relieved by again enabling the token bucket clocks. Flow controlcan be applied dynamically (based on present detected flow rates) or bysetting fixed configuration parameters at the time the system isprovisioned. In other embodiments where it is not possible to pause aclock, a pause operation may

-   -   set the TCP window size to 0,    -   set the token bucket parameters to 0, and/or    -   stop servicing messages from the various queues.

It should be understood that either egress processing or ingressprocessing or both may be paused. So, while pausing ingress processing,if egress is not paused, the system may still send outbound messages tothe clients.

FIG. 7 shows a possible implementation for how flow control 300 may beapplied in an example gateway 120. By way of review, messages flowing infrom each client connection 131-1, 131-2, . . . , 131-n are placed in acorresponding one of the per-connection FIFOs 312-1, 312-2, . . . 312-n.For example, messages received on connection ‘conn 1’ 131-1 are placedinto a FIFO queue 312-1. FIFO 312-1 has three messages queued up, butwith room for a certain number (‘F1’) of additional messages that couldfit before it fills up. A per-connection QOS shaper 322-1, which may beimplemented as a token bucket that controls the burst rate and sustainedrate of messages, pulls the next message from the FIFO queue 312-1 whenit is time to let another message through. That time may be determinedaccording to the per-connection QOS shaper's 320-1 configured ratesettings 340.

From the per-connection QOS shaper 322-1, the message may then enter thearbiter 330 shared across all of the client connections 131. Arbiter 330may also have its own FIFO queue 335. As illustrated, at the presenttime, the shared arbiter's FIFO queue 335 contains four (4) queuedmessages, with room for a certain number (‘A’) of additional messagesbefore the shared arbiter's FIFO queue 335 would be full. The sharedarbiter 330 emits one message at a time from among all the clientconnections, for example, in a round-robin fashion, and sends it to thedistributed system core 420, where it may enter the sequencer 150 (andits corresponding global flow control 500).

As can be seen in FIG. 7, the per-connection TCP window size for ‘corm1’ 130-1 is continuously determined as a function (for example, the‘min’ function) of the number of additional messages (‘F1’) that canstill fit in the per-connection FIFO queue 322-1 (that is, how much“ingress space” is still available for the connection) and the number ofadditional messages (‘A’) that can still fit in the shared arbiter'squeue 335 (that is, how many more messages can the gateway as a wholehandle). Applying a TCP “window squeeze” in this matter, e.g., reducingthe TCP window size, is one way that the gateway can throttle back therate at which connection 131-1 is sending inbound messages. The ratelimiting input ‘S’, which may be determined by the sequencer 150 couldbe a constant across all gateways, or it could be a per-gateway value.

The client 130-1 using connection 131-1 can specify a smaller value thanF1 or A such as in an event that a client because the client may notitself be able to handle the minimum value.

In addition, there may be other mechanisms to adjust the flow rate of aconnection in addition to those mentioned above. For example, when anindividual gateway 130 detects that it is overwhelmed, it can slowitself down, or send a message to the approach the sequencer's role asone among several mechanisms to provide feedback that ultimatelythrottles the per-client-connection rate shaper in the gateways.

Other methods and reasons to throttle client connections may include:

-   -   statically, because of connection classification or server        classification (premier, beginner trial, onboarding, etc.).    -   or dynamically, due to backpressure from any of the connected        client.

As discussed in more detail elsewhere, the global flow controller 500 inthe sequencer 150 might also provide feedback to slow down one, some, orall of the gateways 120 when the global aggregator 530 starts to backup, or any other per-node queue.

It may also be possible to have two aggregators in the sequencer 150—onefor the gateway-to-core direction and a second for the core-to-gatewaydirection. In some embodiments, the global flow in sequencer 150 mayhave a single queue that aggregates messages from both the gateways andcores. But in other embodiments, the global flow 500 in sequencer 150may have two queues, one that aggregates messages received from thegateways 120 and another to aggregate messages received from the cores140.

Another embodiment may configure the compute nodes 140 to assist withcongestion. When a particular compute node 140 becomes too busy, it maysend a message to the gateway flow controller 300 or global flowcontroller 500 asking that the inbound data flow to it be slowed. If themessage is sent first to the global flow controller 500 (or some othercentral authority), the sequencer 150 has the opportunity to decidewhether the congestion in the requesting compute node 140 warrantsslowing down the ingress flow, in which case it could which would thenforward an equivalent message to the gateway(s)) to tell thosegateway(s) to slow down. The sequencer or other central authority manyalso determine that no slow down on ingress needs to take place, forexample, if the response latency is not currently impacted or expectedto be currently impacted, because compute nodes servicing the samesymbols as the overwhelmed compute node are not currently experience anycongestion.

In a case where the system 100 is a trading system, the sequencer mayalso respond to a compute node's 140-1 request to relieve congestion byreassigning symbols away from the congested node 140-1 to some other,less congested node 140-2.

In some scenarios, Queue buildup on egress (e.g., the egress queuefilling to capacity) can in turn affect ingress flow control. This isespecially the case in some embodiments, such as an electronic tradingsystem, where message flow may very well be asymmetric (e.g., the numberof egress messages exceeds the number of ingress messages). This mayoccur in an electronic trading system when:

-   -   A switch dies causing Cancel on disconnect for many sessions.        This will cause a storm of asynchronous cancel messages.    -   Crossing orders, where a single order with a large quantity can        match against many counterparties.    -   Many orders with the same time-in-force expiration    -   End of day, when many single day orders need cancelling

The core nodes 140 could also back up, for example, if a large number of“fill” orders or “cancel on disconnect” are all generated. Those mayhave resulted from some infrastructure failure such as a failure of aswitch between the clients and the system 100. A halt or circuit-breakerlike functionality may thus also be used to slow down the whole system100 for a period of time when certain events occur (such as “market 8%down since open), or time of day (for example, at lunch hour), or basedon an IPO, or based on a holiday schedule.

As was described above in conjunction with FIG. 4, the sequencer 150 mayalso have a role in shaping the rate limit of each gateway 130, andcommunicate feedback to the gateways 130 when significant system-widecongestion is detected. Thus, the functionality illustrated in FIG. 7for calculating the per-connection TCP window size for a clientconnection, can be extended to the global flow controller 500.

FIG. 8 is an example for how flow control may be applied by the globalflow control 500 in an example sequencer 150. By way of review, messagesflowing in from each gateway 120-1, 120-2, . . . , 120-n are placed in acorresponding one of the per-gateway FIFOs 512-1, 512-2, . . . 512-n.For example, messages received from gateway 120-1 are placed into a FIFOqueue 512-1. FIFO 512-1 has three messages queued up, but with room fora certain number (‘F1’) of additional messages that could fit before itfills up. A global QOS shaper 522-1, which may be implemented as a tokenbucket that controls the burst rate and sustained rate of messages,pulls the next message from the FIFO queue 512-1 when it is time to letanother message through. That time may be determined according to theglobal QOS shaper's 522-1 configured rate settings. However, in someembodiments, the global arbiter 530 may pull a message directly from aFIFO queue 512-1 if the QoS shaper 522 “gives it permission”, i.e.,there is at least one token in the corresponding bucket 522-1. Ananalogous implementation is also possible for the per-connection tokenbuckets 322 and arbiter 330 at the gateway flow control level 300.

From a global QOS shaper 522-1, the message may then enter the arbiter530 shared across all of the gateways 130. Arbiter 530 may also have itsown FIFO queue 535. As illustrated, at the present time, the globalarbiter's FIFO queue 535 contains four (4) queued messages, with roomfor a certain number (‘A’) of additional messages before the globalarbiter's FIFO queue 535 would be full. The shared arbiter 530 emits onemessage at a time from among all the gateways, for example, in around-robin fashion, and sends it to the cores 140.

In one embodiment, there may not be a TCP window size adjustment for theconnections between the sequencer 150 and the gateways 130. This isbecause the mesh 172 is likely to be direct, point to point connectionsthat do not require the overhead of a protocol such as TCP.

To summarize, if the global flow 500 needs to reduce congestion, it cando one or more of the following:

-   -   a) adjust the per-gateway token bucket parameters (if possible),    -   b) adjust the per-gateway-to-sequencer connection TCP window        size for one or more gateways (if possible),    -   c) send a message to one or more gateways to have them adjust        their flow, and in response to that message from the global flow        500, the one or more gateways may:        -   i. adjust the TCP window size for all client-to-gateway            connections (e.g., connections 131-1, 131-2, . . . , 131-n)            on that gateway, and/or        -   ii. adjust the per-client-connection token bucket parameters            for all client-to gateway-connections on that gateway,            and/or    -   d) pause one or more of the client-to-gateway connections (using        the various pause options discussed above)

In some embodiments, it may not be the responsibility of the global flow500 in the sequencer 150 to identify an individual client-to-gatewayconnection that might need to be slowed down, as it might be inefficientand/or difficult to do so. More likely, the global flow control 500 inthe sequencer 150 would slow down all traffic from a single gateway(such as gateway 120-1), or the traffic from some subset of all gateways120-1, 120-2, . . . 120-g) or perhaps even all gateways 120. Such flowcontrol may be implemented in any of the ways discussed elsewhereherein.

Other Use Cases

The architecture described above may be of use in applications otherthan electronic trading systems. For example, it is possible that it maybe used to monitor data streams flowing across a network, to capturepackets, decode the packets' raw data, analyze packet content in realtime, and provide responses, for applications other than handlingsecurities trade orders.

Further Implementation Options

It should be understood that the example embodiments described above maybe implemented in many different ways. In some instances, the various“data processors” may each be implemented by a physical or virtualgeneral purpose computer having a central processor, memory, disk orother mass storage, communication interface(s), input/output (I/O)device(s), and other peripherals. The general purpose computer istransformed into the processors and executes the processes describedabove, for example, by loading software instructions into the processor,and then causing execution of the instructions to carry out thefunctions described.

As is known in the art, such a computer may contain a system bus, wherea bus is a set of hardware lines used for data transfer among thecomponents of a computer or processing system. The bus or busses areessentially shared conduit(s) that connect different elements of thecomputer system (e.g., one or more central processing units, disks,various memories, input/output ports, network ports, etc.) that enablesthe transfer of information between the elements. One or more centralprocessor units are attached to the system bus and provide for theexecution of computer instructions. Also attached to system bus aretypically I/O device interfaces for connecting the disks, memories, andvarious input and output devices. Network interface(s) allow connectionsto various other devices attached to a network. One or more memoriesprovide volatile and/or non-volatile storage for computer softwareinstructions and data used to implement an embodiment. Disks or othermass storage provides non-volatile storage for computer softwareinstructions and data used to implement, for example, the variousprocedures described herein.

Embodiments may therefore typically be implemented in hardware, customdesigned semiconductor logic, Application Specific Integrated Circuits(ASICs), Field Programmable Gate Arrays (FPGAs), firmware, software, orany combination thereof.

In certain embodiments, the procedures, devices, and processes describedherein are a computer program product, including a computer readablemedium (e.g., a removable storage medium such as one or more DVD-ROM's,CD-ROM's, diskettes, tapes, etc.) that provides at least a portion ofthe software instructions for the system. Such a computer programproduct can be installed by any suitable software installationprocedure, as is well known in the art. In another embodiment, at leasta portion of the software instructions may also be downloaded over acable, communication and/or wireless connection.

Embodiments may also be implemented as instructions stored on anon-transient machine-readable medium, which may be read and executed byone or more procedures. A non-transient machine-readable medium mayinclude any mechanism for storing or transmitting information in a formreadable by a machine (e.g., a computing device). For example, anon-transient machine-readable medium may include read only memory(ROM); random access memory (RAM); storage including magnetic diskstorage media; optical storage media; flash memory devices; and others.

Furthermore, firmware, software, routines, or instructions may bedescribed herein as performing certain actions and/or functions.However, it should be appreciated that such descriptions containedherein are merely for convenience and that such actions in fact resultfrom computing devices, processors, controllers, or other devicesexecuting the firmware, software, routines, instructions, etc.

It also should be understood that the block and network diagrams mayinclude more or fewer elements, be arranged differently, or berepresented differently. But it further should be understood thatcertain implementations may dictate the block and network diagrams andthe number of block and network diagrams illustrating the execution ofthe embodiments be implemented in a particular way.

Accordingly, further embodiments may also be implemented in a variety ofcomputer architectures, physical, virtual, cloud computers, and/or somecombination thereof, and thus the computer systems described herein areintended for purposes of illustration only and not as a limitation ofthe embodiments.

The above description has particularly shown and described exampleembodiments. However, it will be understood by those skilled in the artthat various changes in form and details may be made therein withoutdeparting from the legal scope of this patent as encompassed by theappended claims.

1. A method of operating a distributed data processing system to controlinbound flow of messages from a plurality of gateway nodes to aplurality of compute nodes and to a system-level node, wherein thedistributed data processing system is an electronic trading system, andthe messages are electronic trading messages, the method comprising: ateach of the plurality of gateway nodes: receiving the messages over oneor more client connections; controlling a sustained flow rate and/or aburst flow rate of the messages on a per-client or per-connection basis;and forwarding the messages to the system-level node; at thesystem-level node: receiving the messages from each of the plurality ofgateway nodes; controlling a system-wide message flow rate, whereincontrolling the system-wide message flow rate further comprisescontrolling a sustained flow rate and/or a burst flow rate on aper-gateway-node-basis for each of the plurality of gateway nodes; andforwarding the messages to the compute nodes; at each of the pluralityof compute nodes: receiving the messages from the system-level node; andoperating on the messages to perform an electronic trading function; ata selected one of the compute nodes: generating a response message; andreturning the response message to a selected one of the gateway nodes;and at the selected gateway node: returning the response message over atleast one of the one or more client connections.
 2. The method of claim1 wherein the messages are application layer messages, in which multipleapplication layer messages are contained in a lower layer protocolpacket; and wherein at least one of the steps of controlling thesustained flow rate and/or burst flow rate further comprises providingfeedback to the lower layer protocol.
 3. The method of claim 2 whereinthe lower layer protocol is a transport layer protocol, and the feedbackis provided by controlling a transport layer window size.
 4. The methodof claim 1 wherein the system-level node is a sequencer node, andwherein the method further comprises: at one or more of the gatewaynodes: forwarding the messages to the sequencer node; at each of thecompute nodes: receiving the messages from the sequencer node; andwherein operating on the messages to perform an electronic tradingfunction further comprises operating on the messages received from theone or more gateway nodes and/or operating on the messages received fromthe sequencer node.
 5. The method of claim 1 wherein at least one of thesteps of controlling sustained flow rate and/or burst flow rate furthercomprises: queuing the messages with a plurality of queues; feeding themessages from the queues to a plurality of token buckets; and selectingthe messages from token buckets.
 6. The method of claim 5 wherein thequeues are FIFOs.
 7. The method of claim 5 wherein the selecting is on around-robin basis.
 8. The method of claim 1 wherein the sustained flowrate and/or burst flow rate is further controlled on the per-client orper-connection basis in response to a client request.
 9. The method ofclaim 1 wherein the system-wide message flow rate is further controlledby providing feedback to one or more of the gateway nodes.
 10. Themethod of claim 9 where the step of providing feedback further comprisesat least one of: lowering a TCP window size on a per-connection basisfor all connections from the system-level node to the one or more of thegateway nodes, or lowering the sustained flow rate and/or the burst flowrate by adjusting at least one parameter of a per-connection tokenbucket for all connections from the system-level node to the one or moreof the gateway nodes, or sending a feedback message to the one or moreof the gateway nodes.
 11. The method of claim 1 wherein the system-widemessage flow rate is further controlled by pausing one or more of thegateway nodes.
 12. The method of claim 11 wherein the step of pausingone or more of the gateway nodes further comprises at least one of:setting a TCP window size to zero for a least one client connection onthe one or more of the gateway nodes, not adding new messages to atleast one per-connection FIFO queue for the one or more of the gatewaynodes, not servicing messages from at least one per-connection FIFOqueue for the one or more of the gateway nodes, or setting at least oneof the sustained flow rate and/or burst flow rate for at least oneconnection on the one or more gateway nodes to zero.
 13. The method ofclaim 1 wherein controlling the sustained flow rate and/or burst flowrate on the per-client or per-connection basis further comprises:controlling a flow from one of the gateway nodes to a selected clientand/or connection.
 14. The method of claim 1 where the step ofcontrolling the sustained flow rate and/or burst flow rate furthercomprises at least one of: lowering a TCP window size on a per-client orper-connection basis; or lowering the sustained flow rate and/or burstflow rate by adjusting at least one parameter of a per-client orper-connection token bucket.
 15. The method of claim 1 wherein thesustained flow rate and/or burst flow rate on the per-client orper-connection basis is further controlled on the per-client orper-connection basis by pausing at least one client or connection. 16.The method of claim 15 wherein the step of pausing at least one clientor connection further comprises at least one of: setting a TCP windowsize to zero on a per-client or per-connection basis, not adding newmessages to a per-client or a per-connection FIFO for at least one ofthe gateway nodes, not servicing messages from a per-client or aper-connection FIFO queue, or setting at least one of the sustained flowrate or burst flow rate for at least one client or connection to zero.17. The method of claim 1 additionally comprising: receiving themessages at the system-level node from the gateway nodes via a full meshset of point to point direct connections.
 18. The method of claim 1additionally comprising: at the system-level node: receiving themessages from the compute nodes; and slowing a rate at which themessages are received from the gateway nodes, when the messages from thecompute nodes are received at greater than a predetermined rate.
 19. Themethod of claim 1 additionally comprising: at each of the compute nodes:slowing a rate at which the messages are received from the gatewaynodes, when the messages received from or to be sent from the computenodes to the gateway nodes exceed a rate greater than a predeterminedrate.
 20. An electronic trading system comprising: a plurality ofgateway nodes configured to: receive messages over one or more clientconnections, wherein the messages are electronic trading messages;control a sustained flow rate and/or a burst flow rate of the messageson a per-client or per-connection basis; and forward the messages to asystem-level node; the system-level node configured to: receive themessages from each of the gateway nodes; control a system-wide messageflow rate for the system, and further to control a sustained flow rateand/or a burst flow rate on a per-gateway-node-basis for each of thegateway nodes; and forward the messages to one or more compute nodes;the one or more of the compute nodes configured to: receive the messagesfrom the system-level node; and operate on the messages to perform anelectronic trading function; a selected one of the compute nodes furtherconfigured to: generate a response message; and return the responsemessage to a selected one of the gateway nodes; and the selected one ofthe gateway nodes further configured to: return the response messageover at least one of the one or more client connections.
 21. (canceled)22. (canceled)
 23. The method of claim 1 further comprising: whereincontrolling the system-wide message flow rate further comprises sendingflow control messages from the system level node to the plurality ofgateway nodes.
 24. The method of claim 1 wherein forwarding the messagesto the system-level node further comprises forwarding the messages usinga layered protocol; and receiving the messages from each of theplurality of gateway nodes further comprises receiving the messagesusing a layered protocol.
 25. The method of claim 1 wherein forwardingthe messages to the compute nodes further comprises forwarding themessages using a layered protocol; and receiving the messages from thesystem level node further comprises receiving the messages using alayered protocol.